First report of lyrics dataset

This first report will an exploration of the lyrics dataset and it will contain:

  1. Input data description and overview
  2. Data processing steps and methods
  3. Data exploration
  4. Next steps

After preprocessing the data to contain useful features, I will use analytical approach to find some dependencies or correlations to be able create a meaningful model based on my findings in the next report.

Input data

The lyrics dataset contains popular songs with lyrics. There are six columns: id, song name, year of publishing, interpret name, assigned genre and the lyrics. Some rows are music composition without lyrics. There are hyphens instead of spaces in names of songs and interprets. Only one of these columns is numeric - the year. The other columns are strings and will be used as categorical variables.

In this section, we will show what the data looks like and try to understand its basics before further processing.

This is what the dataframe looks like:

song_name year interpreter genre lyrics
0 ego-remix 2009 beyonce-knowles Pop Oh baby, how you doing? You know I'm gonna cut right to the chase Some women were made but me, m...
1 then-tell-me 2009 beyonce-knowles Pop playin' everything so easy, it's like you seem so sure. still your ways, you dont see i'm not su...
2 honesty 2009 beyonce-knowles Pop If you search For tenderness It isn't hard to find You can have the love You need to live But if...
3 you-are-my-rock 2009 beyonce-knowles Pop Oh oh oh I, oh oh oh I [Verse 1:] If I wrote a book about where we stand Then the title of my bo...
4 black-culture 2009 beyonce-knowles Pop Party the people, the people the party it's popping no sitting around, I see you looking you loo...
... ... ... ... ... ...
362232 who-am-i-drinking-tonight 2012 edens-edge Country I gotta say Boy, after only just a couple of dates You're hands down, outright blowing my mind I...
362233 liar 2012 edens-edge Country I helped you find her diamond ring You made me try it on and everything Tomorrow you'll both say...
362234 last-supper 2012 edens-edge Country Look at the couple in the corner booth Looks a lot like me and you She's looking out at the wind...
362235 christ-alone-live-in-studio 2012 edens-edge Country When I fly off this mortal earth And I'm measured up by depth and girth The Father says now what...
362236 amen 2012 edens-edge Country I heard from a friend of a friend of a friend that You finally got rid of that girlfriend You fi...

362237 rows × 5 columns

The dataset contains 362237 songs, but only 266557 of them contain any lyrics. This means that it is quite big data (it is one of the largest datasets), especially after featurization of the text. Therefore, I will be doing basic analysis using all samples, but some analyses and modelling will be done only using the songs containing lyrics.

Next, you can see a few charts showing the distribution of values in individual columns. For better readability, I have replaced "-" in song and interpreter names with spaces and capitalized first letters.

We can see that most songs were published in 2006 or 2007, so we should be aware that the results reflect rather just the specific culture of those times.

The table also suggests that there might be some data quality issues regarding the release year since there are 10 songs with the year 2038 and are also a few from a distant past.

The dataset is quite biased with its genre representation, so the results cannot be easily generalized to all music. We also have 29814 songs where the genre is not available and 23683 songs where the genre is labeled as "Other".

Number of unique interpreters: 18231
No. of songs No. of interpreters
0 1 4514
1 2 1388
2 3 832
3 10 763
4 11 719

The representation of interpreters is quite good, there are 18231 unique names. However there are a few that have a lot of songs, but 4514 only have one song and 1388 of them only have two songs.

Data processing

We have already made the first processing step of replacing hyphens with spaces and capitalizing first results. Now we need to take care of data quality issues and then featurize the song lyrics.

1. Data quality

We have already discovered a few data quality issues. Some of the most frequent data quality issues are null values, duplicates, inconsistencies or outliers. I have found no duplicate values in the dataset. So first, we will have a look at columns with null values.

song_name          2
year               0
interpreter        0
genre              0
lyrics         95680
dtype: int64
song_name year interpreter genre lyrics
193957 NaN 2009 Booker T And The Mg S Jazz All right people, the rest of the hard working All star blues brothers are gonna be out here in ...
325992 NaN 2009 Booker T Jazz NaN

We can see that these are mostly rows with missing lyrics. Since these columns might not be very useful for further analyses or modelling, but we want to analyze word counts first, so we will just replace them with an empty string.. We can also see that there are two song names missing which we can see in the table above. Those seem to be mistakes so we can drop them.

We can notice from this table that some interpreters can sometimes collaborate with others so we need to be careful that they might also be present in other interpreter names.

Next, we will have a look at rows with suspicious release years.

song_name year interpreter genre lyrics
27657 Star 702 Clipse Hip-Hop You're my star It's such a wonder how you shine So no matter how far I'm dancing with you in my ...
69708 Anywhere Remix 112 Dru Hill Hip-Hop Here we are all alone You and me, privacy And we can do anything Your fantasy I wanna make your ...
112159 Atchim 2038 Anita Rock
112160 O Areias 2038 Anita Rock
112161 Era Uma Vez Um Cavalo 2038 Anita Rock
112162 Anita 2038 Anita Rock
112163 Todos Os Patinhos 2038 Anita Rock
112164 Joana Come A Papa 2038 Anita Rock
112165 Atirei O Pau Ao Gato 2038 Anita Rock
112166 Eu Vi Um Sapo 2038 Anita Rock
112167 Pipi Das Meias Altas 2038 Anita Rock
112168 Minhoca 2038 Anita Rock
147914 It S Over Now Remix 112 G Dep Hip-Hop What is this? Numbers in your pocket I remember when you Used to throw those things away Why do ...
238541 Come See Me Remix 112 Black Rob Hip-Hop Baby, you can come see me 'cause I need you here with me, and I'll show you what love is made of...
315540 Let S Lurk 67 Giggs Hip-Hop Verse 1: Still pulling up on smoke Skeng in my pocket Can't you see that bulge in my coat Like h...
335205 I Can T Believe 112 Faith Evans Pop [Chorus] I can't believe that love has gone away from me I can't believe that love has gone away...

These rows also seem to be errors. Since we have enough samples, we can also exclude these rows from the dataset (most of these rows will be dropped with null values anyway).

This table revealed another data quality issue - abbreviations like let's or can't are marked the same way as if there were spaces instead of apostrophes. However, this should not be much of a problem since it is consistent between interpreters and song name so we will leave it as is.

Looking at the data also reveals that not all songs are in English. For example the following interpreter sings in German. We will need to take this into consideration when creating a model since creating a single model determining interpreter or genre out of lyrics will be greatly influenced by the language of the song. However, we will not adjust the data for this yet.

song_name year interpreter genre lyrics
385 Wer Liebe Sucht 2006 Daliah Lavi Not Available Ist das so schwer ein kleines Lächeln wenn du fühlst ein Mann gefällt dir sehr? Dann ist am Aben...
386 Es Geht Auch So 2006 Daliah Lavi Not Available Der Weg den du und ich gegangen führt mit einem Mal in's graue Niemandsland wann hat es angefang...
387 Liebeslied Jener Sommernacht 2006 Daliah Lavi Not Available Rote Schatten warf das Feuer hell wie Gold war der Tokayer als ein Fremder plötzlich vor mir sta...
388 Willst Du Mit Mir Gehn 2006 Daliah Lavi Not Available Willst Du mit mir gehn,Wenn mein Weg in Dunkel führt. Willst Du mit mir gehn, Wenn mein Tag scho...
389 Meine Art Liebe Zu Zeigen 2006 Daliah Lavi Not Available Meine Art Liebe zu zeigen das ist ganz einfach Schweigen. Worte zerstören wo sie nicht hingehöre...

There are also inconsistencies in song lyrics and the style in which they are written. Some contain spelling mistakes, but there is not much we can do about that. We could see in the above tables that some lyrics contain parts like "Verse 1:" or "[Chorus]". This would be useful if it was everywhere, but since it is not, it would be better to remove them. We will remove all parts in [] brackets and add "Chorus" and "Verse" to stopwords used when making a model.

2. Featurization

Since words cannot be used for modelling as they are, we will need to create some features to be able to analyze them further. We will come back again to featurization in modelling stage since right now I cannot be sure which features we will need. But for exploration, I will create the following features:

  • Word count: the total number of words in the lyrics
  • Unique word count: the number of unique words in the lyrics
  • Average word length: the average length of the words in the lyrics
  • Word frequencies (TF-IDF featurization will be used in modelling, however, it is not so good for exploration purposes)
word_count unique_word_count average_word_length word_count_ratio
mean 166.495601 79.431849 4.158632 2.140290
std 167.264722 76.453418 1.062458 1.064717
min 0.000000 0.000000 1.000000 1.000000
25% 0.000000 0.000000 3.812500 1.597403
50% 146.000000 77.000000 4.017065 1.949367
75% 240.000000 111.000000 4.265896 2.444444
max 7914.000000 2746.000000 58.000000 87.000000

The following boxplot shows the distribution of wordcount and unique word count among genres. Some outliers have been hidden since they would make the chart difficult to read.

This boxplot seems to contain a lot of useful information. For example, we can see that there are genres that are very rarely without lyrics as well as genres that very often have no lyrics. Above all, songs labeled as "Other" rarely have any lyrics. Jazz and Electronic have at least half of their songs with no lyrics. Hip-Hop also has many songs with no lyrics, however, on average, it has the highest amount of words as well as unique words. Although there are many words in pop, there are not many unique ones.

Not Available will probably not be very useful since it just seems to be an average case.

Word length does not differ very much between genres. Generally Metal has the longest words and also the largest variance. One explanation that comes to my mind is that it might often be in a different language, but we can come back to this hypothesis in modelling stage. The word length is mostly around 4.

This is the frequency of individual words in the lyrics, after some basic stopwords were removed. The smaller stopwords version was used with some modifications, since in songs there can be quite a lot of meaning in otherwise meaningless words.

<Axes: >

We can see that term frequency does not differ very much between genres. The chart also contains some spanish words like "que", "de" and "la".

Data exploration

Exploration tips:  genres and interprets with most songs  multigenre interprets  distribution of songs features (word count, number of unique words etc.)  song covers and remakes (same name and lyrics, possibly with small differences)  typical words or patterns for various genres and for various interprets

Next steps

Based on the results, in my next report, I would like to dedicate space to the following topics:

  • creating a representative sample to be able to process data faster
  • separating songs based on their language (either by using a library or by unsupervised clustering): this could significantly improve performance of any other model
  • prediction of song genre/interpreter/year according to lyrics